Search CORE

24 research outputs found

Recommended from our members

Natural-language video description with deep recurrent neural networks

Author: Venugopalan Subhashini
Publication venue
Publication date: 13/12/2017
Field of study

For most people, watching a brief video and describing what happened (in words) is an easy task. For machines, extracting meaning from video pixels and generating a sentence description is a very complex problem. The goal of this thesis is to develop models that can automatically generate natural language descriptions for events in videos. It presents several approaches to automatic video description by building on recent advances in “deep” machine learning. The techniques presented in this thesis view the task of video description akin to machine translation, treating the video domain as a source “language” and uses deep neural net architectures to “translate” videos to text. Specifically, I develop video captioning techniques using a unified deep neural network with both convolutional and recurrent structure, modeling the temporal elements in videos and language with deep recurrent neural networks. In my initial approach, I adapt a model that can learn from paired images and captions to transfer knowledge from this auxiliary task to generate descriptions for short video clips. Next, I present an end-to-end deep network that can jointly model a sequence of video frames and a sequence of words. To further improve grammaticality and descriptive quality, I also propose methods to integrate linguistic knowledge from plain text corpora. Additionally, I show that such linguistic knowledge can help describe novel objects unseen in paired image/video-caption data. Finally, moving beyond short video clips, I present methods to process longer multi-activity videos, specifically to jointly segment and describe coherent event sequences in movies.Computer Science

Texas ScholarWorks

Sequence to Sequence -- Video to Text

Author: Darrell Trevor
Donahue Jeff
Mooney Raymond
Rohrbach Marcus
Saenko Kate
Venugopalan Subhashini
Publication venue
Publication date: 19/10/2015
Field of study

Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem, we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).Comment: ICCV 2015 camera-ready. Includes code, project page and LSMDC challenge result

arXiv.org e-Print Archive

Crossref

Generating Video Descriptions with Topic Guidance

Author: Ba Lei Jimmy
Hinton Geoffrey
Lebret Rémi
Lin Chin-Yew
Venugopalan Subhashini
Publication venue
Publication date: 04/09/2017
Field of study

Generating video descriptions in natural language (a.k.a. video captioning) is a more challenging task than image captioning as the videos are intrinsically more complicated than images in two aspects. First, videos cover a broader range of topics, such as news, music, sports and so on. Second, multiple topics could coexist in the same video. In this paper, we propose a novel caption model, topic-guided model (TGM), to generate topic-oriented descriptions for videos in the wild via exploiting topic information. In addition to predefined topics, i.e., category tags crawled from the web, we also mine topics in a data-driven way based on training captions by an unsupervised topic mining model. We show that data-driven topics reflect a better topic schema than the predefined topics. As for testing video topic prediction, we treat the topic mining model as teacher to train the student, the topic prediction model, by utilizing the full multi-modalities in the video especially the speech modality. We propose a series of caption models to exploit topic guidance, including implicitly using the topics as input features to generate words related to the topic and explicitly modifying the weights in the decoder with topics to function as an ensemble of topic-aware language decoders. Our comprehensive experimental results on the current largest video caption dataset MSR-VTT prove the effectiveness of our topic-guided model, which significantly surpasses the winning performance in the 2016 MSR video to language challenge.Comment: Appeared at ICMR 201

arXiv.org e-Print Archive

Crossref

Assessing ASR Model Quality on Disordered Speech using BERTScore

Author: Cave Richard
Li Qisheng
Seaver Katie
Tobin Jimmy
Tomanek Katrin
Venugopalan Subhashini
Publication venue
Publication date: 21/09/2022
Field of study

Word Error Rate (WER) is the primary metric used to assess automatic speech recognition (ASR) model quality. It has been shown that ASR models tend to have much higher WER on speakers with speech impairments than typical English speakers. It is hard to determine if models can be be useful at such high error rates. This study investigates the use of BERTScore, an evaluation metric for text generation, to provide a more informative measure of ASR model quality and usefulness. Both BERTScore and WER were compared to prediction errors manually annotated by Speech Language Pathologists for error type and assessment. BERTScore was found to be more correlated with human assessment of error type and assessment. BERTScore was specifically more robust to orthographic changes (contraction and normalization errors) where meaning was preserved. Furthermore, BERTScore was a better fit of error assessment than WER, as measured using an ordinal logistic regression and the Akaike's Information Criterion (AIC). Overall, our findings suggest that BERTScore can complement WER when assessing ASR model performance from a practical perspective, especially for accessibility applications where models are useful even at lower accuracy than for typical speech.Comment: Accepted to Interspeech 2022 Workshop on Speech for Social Goo

arXiv.org e-Print Archive